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Abstract 


Current  digital  system  design  practices  make  heavy  utilization  of  various  types 
of  memories.  RAMS  and  ROMS  are  used  in  main  memory,  register  files,  caches,  and 
microstores.  As  a result,  it  becomes  important  to  recognize  the  implications  of 
memory  chip  failure  modes  for  system  reliability. 

A brief  survey  of  available  memory  chip  failure  mode  data  is  made  and  shows 
that  partial  chip  failures  are  more  prevalent  than  whole  chip  failures.  Based  on  the 
findings  of  this  survey,  reliability  models  for  memory  systems  with  error  coding 
techniques  are  developed.  The  effect  of  memory  support  circuitry  on  memory 
reliability,  usually  Ignored  in  the  development  of  analytical  models,  is  included.  It  is 
shown  that  for  wide  ranges  of  memory  system  parameters  and  memory  element 
failure  rates  the  memory  system  reliability  is  dominated  by  the  effect  of  the 
support  electronics.  The  use  of  these  models  in  design  tradeoff  decisions  Is 
explored. 

The  performance  of  systems  with  fault  tolerant  memory  when  there  are 
correctable  failures  present,  an  area  which  has  seen  little  work,  is  analyzed. 
Performance  models  for  systems  with  fault  tolerant  main  memory,  as  well  as  those 
with  fault  tolerant  microstore,  are  developed  and  their  properties  explored. 

Hamming  code  is  one  of  the  error  corecting  techniques  considered.  Block  codes, 
commonly  used  for  tape  media  but  rarely  if  ever  for  RAM  or  ROM,  are  also 
considered  and  found  competitive  with  Hamming  codes  in  many  cases. 


This  research  was  supported  In  part  by  Digital  Equipment  Corporation  and  In  part  by  the 
office  of  Naval  Research  under  contract  N00014-77-C-0103. 
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1.  Introduction  and  Overview 

Memory  elements  are  currently  being  used  heavily  in  all  areas  of  digital  systems  design. 
They  see  use  as  microstores,  register  files,  caches,  and  of  course,  main  memory. 
Improvement  of  memory  element  reliability,  then,  will  have  the  effect  of  greatly  improving  the 
reliability  of  those  systems  being  designed  now  and  in  the  future. 

A primary  method  of  increasing  memory  reliability  is  the  use  of  error  correction  codes, 
such  as  Hamming  and  block  codes  [Pete72].  These  techniques  result  in  a tolerance  to  single 
bit  faults  in  a memory  word.  The  tradeoff  involved  is  one  involving  system  cost,  complexity, 
performance,  servicability,  and  reliability  (these  last  two  together  determine  field  repair 
costs).  For  an  increase  in  cost  and  complexity,  the  memory  reliability  can  be  greatly 
enhanced  with  little  or  no  decrease  in  performance.  To  help  in  the  tradeoff  decisions  to  be 
made  during  system  design,  tools  for  predicting  the  reliability  and  the  performance 
degradation  in  the  presence  of  errors  are  needed.  The  accuracy  of  these  tools  can  be 
increased  by  examining  the  failure  modes  of  the  memory  components  used  to  build  the 
memory  systems.  A study,  summarized  in  Chapter  2,  was  made  of  the  available  data  on 
semiconductor  memory  chip  failure  modes.  The  results  of  the  study  were  used  as  the  basis 
for  the  reliability  models 

The  tools  needed  for  design  decisions  are  developed  in  Chapter  3.  Two  error  correcting 
schemes,  Hamming  codes  and  block  codes,  are  examined.  The  design  tools  are  then 
formulated  to  be  used  for  any  size  of  single  error  correcting  memory,  and  Include  the  effects 
of  the  support  circuitry  needed  to  complete  an  entire  memory  system.  Easily  calculable 
formulae  for  memory  system  reliability,  MTTF,  and  hazard  function  are  presented  In  Chapter 
3 and  are  analyzed  in  Chapter  4. 

Models  for  the  degradation  of  system  performance  in  the  presence  of  memory  component 
failures  in  both  main  memory  and  microstor*  are  developed  In  Chapter  5.  The  performance 
degradation  when  failures  have  occured  in  the  memories  of  example  systems,  Including  the 
PDP-11,  is  analyzed  In  the  final  subsections  of  Chapter  5. 
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2.  Memory  Chip  Failure  Modes 


Data  on  semiconductor  memory  chip  failure  modes  during  operating  life  is  available  from 
few  sources.  Most  semiconductor  manufacturers  are  more  Interested  in  the  physical 
mechanisms  of  failure  than  in  the  functional  characteristics  of  a failure.  What  data  is  available 
comes  mostly  from  screening,  burn-in,  and  to  a lesser  extent,  high-temperature 
accelerated-life  tests.  A summary  of  some  of  the  data  we  have  collected  is  in  Table  2-1. 


TABLE  2-1 

Chip  Failure  Mode  Data  Summary 


Source  Devices 

- - Percentage  of  All  Failures 
Whole  Single  Row/ 

Chip  Bit  Column 

Not 

Known 

[Texa7?]  4K  MOS  RAM 

- 

92 

- 

8 

[Par.c75]  4K  MOS  RAM 

(burn-in  & cell  stress 
screening  tests) 

11.8 

35.3 

29.4 

23.5 

[Rick76]  varied  PROMs 

(accel.  life  tests, 
using  some  guessing) 

17.9 

53.9 

15.3 

12.9 

[Goar76]  8K  MOS  UV  PROM 

(700k  device  hours  in 
accel.  life  testing) 

100.0 

The  data  shows  that  memory  chip  failure  modes  are,  unsurprisingly,  dependent  on 
technology,  process,  and  device  design  and  thus  may  vary  widely.  Failure  mode  distributions 
also  change  with  time  for  a given  device  as  the  fabrication  process  matures^.  Nevertheless, 
there  Is  good  evidence  that  the  whole  chip  failure  modes  (i.e.  complete  Inability  to  store 
and/or  retrieve  data)  do  not  dominate  for  most  devices.  Single  bit,  row,  and  column  failure 
modes  seem  to  be  the  effect  of  the  majority  of  device  failures.  This  fact  motivated  the 

*The  TI  dal a indicates  that  92?  of  tha  faikiraa  ebaarvad  war#  single  bil  (aikiraa.  A conversation  wilh  a TI  reliability 
engineer  revealed  that  recent  testa  show  only  about  half  of  the  failures  observed  were  single  bit  failures,  reflecting 
pocese  improvements  since  [Ta«s7’]  was  released  Although  these  testa  were  conducted  for  operating  periods  short 
relative  to  actual  field  operating  life,  the  TI  engineer  felt  that  long  term  field  data  would  still  show  a dominant 
percanlage  of  partial  array  failures. 
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formulation  of  the  error  correctine  code  (ECC)  memory  models  oresented  In  the  followlna  two 
sections. 
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3.  Memory  Organizations  and  Their  Reliability  Models 

Wang  and  Lovelace  [Wang77]  present  a model  for  main  memory  reliability,  based  on  the 
use  of  4096  bit  chips  in  a 16  bit  per  word  memory  system  using  a Hamming  single  error 
correcting/double  error  detecting  code.  Another  model  by  Levine  and  Meyers  [Levi76]  uses 
charts  and  tables  to  determine  the  reliability  of  Hamming  coded  memories.  The  model  is 
based  on  the  whole  chip  failure  mode.  Neither  model  allows  for  the  effect  of  the  control 
reliability  on  the  total  memory  system  reliability.  The  models  presented  and  examined  in  this 
paper  cover  any  single  error  correction  scheme  for  any  size  memory,  and  are  developed  In 
such  a way  that  the  reliability  of  all  the  control,  correction,  and  Interface  circuitry  for  the 
memory  element  is  included,  thus  modelling  the  reliabilty  of  the  entire  memory  system. 
Further,  a new  formula  is  drived  that  can  be  used  to  efficiently  calculate  mean  time  to  failure 
(MTTF)  of  any  of  the  various  models. 

In  this  section  three  models  for  error-correcting  code  (ECC)  memory  reliability  are 
presented.  Each  model  is  based  on  a different  assumption  of  dominant  memory  chip  failure 
mode.  Two  of  them  provide  upper  and  lower  bounds  for  the  reliability  of  an  ECC  memory. 
For  comparison  we  present  a model  for  the  non-redundant  memory.  All  of  the  models  assume 
that  component  failures  in  the  memory  support  circuitry  cannot  be  survived. 

To  develop  the  reliability  models  the  properties  of  two  error  correcting  schemes, 
Hamming  codes  and  block  codes,  are  examined.  One  of  the  measures  to  be  used  is  mean  time 
to  failure  (MTTF).  MTTF  is  used  widely  in  design  trade-off  decisions  and  In  such  business 
planning  activities  as  availability  and  life  cycle  cost  activities.  The  other  measure  will  be  the 
hazard  function  z(t),  which  expresses  the  instantaneous  failure  rates  at  time  t.  The  hazard 
function  is  not  only  easier  to  measure  in  practical  situations  but  its  shape  can  also  say 
something  about  the  shape  of  the  reliability  function. 

3.1.  Single  Error  Correcting  Memory  Properties 

The  ECC  memory  reliability  models  depend  on  the  properties  of  the  single  error 
correcting  schemes  used.  In  examining  the  Hamming  and  block  code  ECC  schemes,  two  types 
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of  memory  words  are  considered.  The  first  is  called  a loeical  word  and  is  the  word  that  the 
system  using  the  memory  needs.  Error  correction  in  the  memory  itself  Is  done  for  physical 
words,  which  are  made  up  of  one  or  more  logical  words  In  addition  to  whatever  coding  bits 
are  needed. 

For  Hamming  codes  a b bit  word  has  c coding  bits  (which  may  or  may  not  include  the 
extra  bit  for  double  error  detection)  added  to  it.  The  total  number  of  bits  is  k - (b+c). 
Several  logical  words  may  be  combined  into  a larger  physical  word  for  error  encoding,  thus 
decreasing  the  total  of  the  coding  bits  in  the  memory.  If  j logical  words  go  into  a physical 
word  that  includes  e coding  bits,  the  physical  word  size  becomes  k - (bj+e),  and  the  number 
of  physical  words  In  an  x logical  word  memory  Is  w - (x/j).  For  a complete  explanation  of 
Hamming  codes,  see  [Pete72J. 

Block  codes  are  widely  used  for  tape  media-based  memory  systems,  but  have  seen  little 
or  no  use  In  other  types  of  memories.  In  this  scheme,  each  word  has  a parity  bit  appended 
(horizontal  parity  bit)  and  j words  of  b bits  are  grouped  together  to  form  a block.  Each  block 
has  an  extra  word  associated  with  it,  each  of  whose  (b+1)  bits  is  the  parity  bit  for  the 
appropriate  bit  slice  of  the  block  (vertical  parity  bits).  The  total  number  of  bits  in  the 
physical  word  Is  k « (b+l)*(j+l),  and  for  an  x logical  word  memory  there  are  w - (x/j) 
physical  words.  In  the  case  of  a single  error,  a horizontal  parity  errpr  Is  found,  and  a 
vertical  parity  word  reconstructed.  The  Intersection  of  the  horizontal  parity  error  and 
vertical  rarity  error  pinpoint  the  erroneous  bit.  This  meihod  of  coding  also  allows  double 
errors  to  be  detected,  although  not  recovered  from.  The  block  code  suffers  no  degradation 
over  the  Hamming  code  in  error  detection  due  to  the  horizontal  parity.  Correction,  however, 
is  slower  since  the  vertical  parity  of  the  whole  block  has  to  be  calculated.  Since  correction 
occurs  infrequently  for  transient  errors  this  slow  correction  is  not  a penalty.  Even  in  the 
presence  of  hard  failures,  the  block  code  suffers  very  little  In  performance  degradation  as 
Illustrated  in  Chapter  5. 

Both  the  Hamming  coded  and  block  coded  memories,  then,  have  k bit  physical  words  and 
w physical  words  in  the  memory.  The  only  difference  between  the  two  schemes  as  far  as  the 
model  Is  concerned  is  that  these  values  are  different.  In  each  case,  the  memory  can  tolerate 
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no  more  than  one  failure  in  the  k bits  of  a Given  word  in  a w word  memory.  This  common 
properly  is  the  one  upon  which  the  following  development  Is  based. 

3.2.  Single  Error  Correcting  Memory  Models 

The  first  ECC  memory  model  assumes  that  single  memory  bit  cell  failures  dominate,  and 
provides  an  upper  bound  on  reliability.  The  second  assumes  that  the  dominant  failure  mode 
is  complete  functional  failure  of  memory  chips,  and  provides  a lower  bound.  Between  these 
two  extremes  lie  row  and  column  failures  in  the  arrays  internal  to  the  chips,  and  distributions 
of  whole  chip,  single  cell,  and  row/column  failures.  For  completeness  a third  model  for  ECC 
memory  reliability  Is  presented.  This  model  assumes  that  the  row  (column)  failure  mode  Is  the 
dominant  failure  mode  tor  memory  devices. 

3.2.1.  Single  Bit  Failure  Mode  (SBFM)  Model 

• Single  bit  cell  failures  are  assumed  to  be  independent  events,  with  each  cell  following  the 
exponential  failure  law  with  failure  rate  The  reliability  function  for  a single  bit  cell  is 
then 

Rb  ( t)  - e"**3* 

Each  k bit  word  can  tolerate  the  failure  of  a single  bit.  Thus  the  reliability  of  a given 
word  is  : 

Rg(t)  - Rbk+  k (1  - Rb)  Rb<k_11 
For  a w word  memory  the  array  reliability  is 

RaSbit)  - 1 •Vk"1)-  ,k-ij  «bk  ,u 

Fault-free  operation  of  the  memory  requires  the  selection,  control,  and  decoding  circuitry 
to  be  functioning  correctly.  It  Is  assumed  that  these  also  follow  exponential  failure  processes 
with  failure  rates  Xg,  Xk,  and  X^  respectively.  The  reliability  of  the  complete  memory  Is  then 
expressed  as  : 


Rn,ab(" 


,'(VVV * 


(ke 


(k-l)Xbt 


(k-1) 


-kxbt 


) (3-1) 
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The  mean  time  to  failure  (MTTF)  is  defined  [Shoo68]  by 


MTTF  -/  R ft)  dt 

J 0 


The  mean  time  to  failure  of  the  memory  is  : 


I MTTr_  /*"  -Vt  - (k-1)  A.  t , -k\t  u 

MTTF9b  =/  e e (ke  b - (k-1).  e b ) dt 

i J 0 


(J_  + Ij’L 

V f0  f0fi 


k 98*  * • -9(w-l)  , 

+....+  ) 

• • • tw 


(3-2) 


where 


Xb  * memory  bit  failure  rate, 

Xe  * xs  + \ * Xd  • 
fj  - wk  + Xe/Xb  - i , 

and  gj  - w - i. 

The  MTTF  of  the  memory  array  alone  is  obtained  by  setting  Xe/Xb  * 0-  A detailed 
derivation  of  equation  (3-2)  is  given  in  Appendix  I.  It  is  significant  to  note  that  MTTF 
calculations  for  ECC  memories  were  extremely  tedious  (e.g.  Monte  Carlo  simulations  or, 
numerical  integrations  were  used)  before  the  derivation  of  equation  (3-2). 


The  hazard  function  z(t)  of  a system  is  defined  [Shoo68]  by 


z(t)  - 


(3-3) 


where  f(l)  is  the  failure  density  function 

a 

f (t)  R (t)  . 

at 


The  hazard  function  for  the  SBFM  model  can  be  shown  to  be 
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3.2.2.  Whole  Chip  Failure  Mode  (WCFM)  Model 


The  whole  chip  failure  mode  model  is  similar  to  the  SBFM  model  with  only  a few 
differences.  Since  a single  error  correcting  memory  architecture  with  more  than  one  bit  per 
word  per  chip  is  not  tolerant  of  whole  chip  failures,  it  is  assumed  that  words  are  distributed 
over  the  chips  in  such  a way  that  no  word  has  more  than  one  bit  on  the  same  chip.  Thus  for 
a memory  with  k bit  words/implemented  with  d bit  chips,  the  parameter  w (number  of  words 
in  memory)  in  the  SBFM  model  is  transformed  to  h = w/d.  In  effect  the  memory  is  organized 
into  rows  of  k chips  each,  every  row  containing  d words;  h is  then  the  number  of  such  rows. 
The  MTTF  is  then  expressed  by 


MTTF 


uc 


■/; 


-\,t  ..  -(k-l)x  t 


(ke 


-kX„t  h 


c - (k-1)  e c ) dt 


1 kgg  k gg- . • • g (h_i ) 

(—  + ' +••••+  - ■ ~ ■ - ■ ) 


Xc  f0  f0fl 


f0. . . . fh 


(3-4) 


where  Xc  » memory  chip  faille  rate, 


Xe  " Xs  + Xk  * Xd  * 


fj  - hk  + \e/\c  - i , 


and  gj  - h - i. 


When  Xe/Xc  - 0 this  is  an  expression  for  memory  array  MTTF. 
The  WCFM  model  hazard  function  is 


(1 


-X  t 


2uc^  * Xe  + xc  h k (k-1) 


(3-5) 
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3.2.3.  Row  (Column)  Failure  Mode  (RFM)  Model 


This  model  is  also  similar  'o  the  SBFM  model.  There  is,  as  in  the  previous  model,  a 
restriction  on  the  memory  architecture,  albeit  not  as  stringent.  Having  more  than  one  bit  per 
word  per  chip  is  possible  as  long  as  there  is  no  more  than  one  bit  per  word  per  row  (column) 
internal  to  the  chip.  If  this  condition  is  not  met,  this  model  snould  be  replaced  by  the  whole 
chip  failure  mode  model. 


For  a w word  memory  of  k bit  words  implemented  with  d bit  memory  chips  havin  q bits 
per  row  (column),  w of  the  SBFM  model  is  replaced  by  p = w*q/d,  which  is  the  number  of  one 
word  wide  sets  of  rows  (columns)  in  the  memory  architecture.  The  MTTF  is  then  expressed 
by 


MTTF 


r -v 
r 70e 


„ -(k-l)X  t , ,,  -kX_t  p 

(ke  r - (k-1)  e r )P  dt 


1 1 


k9g 


f0  f0fl 


k 90* • • * g (p-i) 

+ »»#•+  ■ ...  ..  — ■ ' ■ — ) 

f0. . . . fp 


where  Xr  - row  (column)  failure  rate, 


Xe  “ Xs  + Xk  + Xd  • 
f.  - pk  + Xe/\r  - i , 


and  g;  **  p - i. 


The  hazard  function  for  the  RFM  model  is 


zp  (t)  « Xg  + Xp  p k (k-1) 


3.3.  Non-Redundant  Memory  Model 


(1  - e_>vt) 


Ik  - (k-1)  e~*rt) 


I 


The  model  for  non-redundant  memory  is  based  on  the  assumptions  that  components  have 
exponential  failure  processes  and  that  any  component  failure  results  in  complete  memory 
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failure.  The  control,  selection,  and  storaee  arrav  circuitry  have  failure  rates  X^.  Xs>  and 
X respectively.  The  reliability  of  the  array  Is  then  expressed  by 

3 


n -x  t 

Ranr  “ e 3 


where  Xenr  “ \ * V 


The  MTTF  of  the  memory  Is 


(3-6) 


and  that  of  the  entire  memory  by 

— (X  +X  ) t 

R - 8 enr  3 


x + x 

enr  a 


The  non-redundant  memory  has  the  constant  hazard  function 
2~(t)  ’ \.nr  + *a  * 


(3-7) 


(3-8) 
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4.  ECC  Memory  Reliability  Exploration  via  the  Models 

The  single  bit  failure  mode  (SBFM),  whole  chip  failure  mode  (WCFM),  and  nonredundant 
(NR)  memory  models  are  compared  in  this  section.  The  measures  used  are  MTTF,  the  hazard 
function  z(t),  and  the  reliability  function  R(t). 

Where  specific  values  for  memory  chip  reliability  are  used,  they  are  based  on  the  failure 
rate  range  for  4096  bit  chips  found  in  Table  4-1.  The  ranges  in  Table  4-1  cover  observed 
failure  rates  for  state-of-the-art  chips.  The  reliabilities  of  control  circuitry  for  error 
correcting  and  nonredundant  memories  are  derived  from  models  for  the  memories  depicted  In 
Figure  1,  assuming  the  use  of  standard  SSI/MSI  logic.  The  memories  modeled  in 

Figure  1 are  assumed  to  be  "bare  bones”  memories  of  relatively  simple  design.  Figure  la 
depicts  a nonredundant  b bit  memory  of  w words.  Hamming  single  error  correcting 
capabilities  are  added  to  It  as  shown  in  Figure  lb  by  increasing  the  array  size  to  Include  the 
coding  bits.  Extra  control  and  data  manipulation  facilities  (e.g.  MUXes,  parity  trees,  XORs, 
registers,  etc.)  are  added  to  perform  error  correction  and  detection,  as  well  as  error  coding 
when  writing  into  the  memory.  When  j logical  words  are  combined  into  a larger  physical 
word  to  limit  the  Increase  in  array  size,  extra  logic  in  the  form  of  wider  data  paths,  more 
complex  coding/decoding  circuitry,  and  a final  one-of-j  switch  is  needed. 

TABLE  4-1 

Memory  Chip  Failure  Rates 
chip  : 409G  bit,  1 bit  per  uord 

fal lure  rate 

0.0000122 
0.0000488 
0.000122 
0.000732 
0.00122 

The  block  coded  memory  is  shown  In  Figure  lc.  The  coding/decoding  logic  for  block 
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Figure  lc.  Block-coded  RAM  model 


codine  is  less  comolex  than  for  a Hammine  code.  For  examole.  onlv  one  oaritv  tree  Is  needed 
where  the  Hamming  coded  memory  needs  several.  Also  the  block  code  requires  fewer 
redundant  bits  than  the  Hamming  code.  The  block  code  works  in  the  following  manner.  When 
a word  is  read  and  XORed  with  zeros  being  fed  into  the  other  leg  of  the  XOR  array  (0  Is  the 
XOR  identity  operator),  the  horizontal  parity  is  calculated  by  the  parity  tree.  If  there  is  an 
error,  the  vertical  parity  for  the  block  is  calculated  by  successively  XORing  words  from  the 
memory  block  with  what  Is  already  in  the  register.  Note  that  the  vertical  parity  word  could 
be  stored  in  a register  file  outside  of  the  linear  memory  address  space.  The  results  of  the 
new  vertical  parity  point  to  the  bit  in  error.  If  more  than  one  horizontal  parity  bit  in  the 
block  Indicates  an  error,  a multiple  bit  failure  has  occurred  and  the  error  is  unrecoverable. 
In  the  case  of  a write,  the  horizontal  parity  is  calculated  and  the  vertical  parity  is  updated 
simply  by  XORing  the  new  and  old  data  words  with  the  old  vertical  parity  word.  Since  writes 
to  memory  occur  only  10-302  of  the  time,  degradation  uue  to  vertical  parity  update  is  small. 
However,  the  block  code  is  particularly  effective  for  read  only  memory  since  the  extra 
complication  on  writes  is  not  necessary.  Note  that  the  vertical  parity  word  could  be  stored  in 
a separate  memory  array,  thus  allowing  the  update  of  the  vertical  parity  word  to  proceed  In 
parallel  with  the  data  write. 
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Block  coding  of  small  memories  presents  some  problems  because  of  the  relatively  large 
physical  word  size  and  small  number  of  physical  words  in  the  memory.  If  whole  chip  failures 
are  to  be  tolerated,  the  chips  have  to  be  small  in  size  and  large  In  number. 

TABLE  4-2 

Control  Circuitry  Failure  Rates 


log. 

word 

phys. 

word 

log. 

memory 

— 

failure  rates  — 

- 

size 

J 

size 

size 

NR 

BCC 

ECC 

16 

1 

22 

32K 

1.39 

_ 

9.02 

16 

1 

22 

64K 

1.39 

- 

9.02 

16 

2 

39 

64K 

1.39 

- 

13.35 

16 

4 

72 

64K 

1.39 

- 

22.67 

16 

16 

289 

64K 

1.39 

4.39 

- 

32 

1 

39 

16K 

1.61 

- 

12.81 

32 

1 

39 

32K 

1.61 

- 

12.81 

32 

1 

39 

64K 

1.61 

- 

12.81 

64 

1 

72 

8K 

2.06 

- 

20.39 

64 

1 

72 

32K 

2.06 

- 

20.39 

64 

1 

72 

64k 

2.06 

20.39 

The  resulting 

control 

reliabilities 

are  summarized  In 

Table  4-2,  and 

the  detailed 

;ign/reliability  derivations 

can  be  found  In  Appendix  APPB. 

All  failure  rates 

are  in  units  of 

failures  per  million  hours. 

4.1.  MTTF 


To  make  MTTF  comparisons  of  the  SBFM  and  WCFM  models,  a normalized  MTTF  Is  used. 
This  is  done  to  avoid  dependence  on  specific  reliabilities  of  the  current  or  any  other 
technology,  and  was  accomplished  by  multiplying  the  MTTF  formulae  from  Sections  3.2.1  and 
3.2.2  by  Xjj.  When  this  Is  done  the  MTTF  becomes  a function  of  the  ratio  Xg/X^,  instead  of 
being  a function  of  \g  and  X^.  The  MTTF  of  the  memory  becomes 

MTTC  . 1 kS8  kV"-9(u-l), 

f1TTFBb.horm  • ( — + + • •••  + ) 


Vl 


i0""fg 


for  the  SBFM  model,  and 
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for  the  WCFM  model.  Note  that  MTTFwcnQrm  is  still  dependent  on  the  number  of  bits  per 
chip.  The  control  circuitry  MTTF  can  also  be  normalized  using  multiplication  by  X^,  and 
becomes 

MTTFe.norm  “ W 

It  Is  also  possible  to  normalize  the  nonredundant  memory  MTTF  in  the  same  way, 
presuming  that  the  ratio  r - X___/X_  is  known.  The  normalized  MTTF  for  the  nonredundant 

“mi  c 

memory  becomes 


MTTF 


nr.norm 


1 

r (Xe/Xb)  + w b 


In  Figure  2 are  plotted  the  normalized  MTTF  curves  against  the  ratio  Xg/X^.  These 
curves  are  for  16  bit  logical  word  memories  of  16K  and  64K  words,  using  both  the  SBFM  and 
WCFM  (assuming  1024  bits  per  chip)  ECC  models  and  the  nonredundant  memory  model. 


Figure  2 illustrates  a factor  of  20-25  difference  in  MTTF  prediction  for  the  SBFM  over 
the  WCFM  model  for  small  values  of  Xe/Xb,  with  the  size  memories  modeled.  As 
Xg/X^  increases,  the  ECC  memory  MTTF  becomes  essentially  that  of  the  support  circuitry 
(which  would  plot  as  a line  with  unit  slope  going  through  the  origin).  Thus  the  limiting  factor 
on  the  memory  reliability  is  the  support  circuitry  reliability.  The  plot  also  shows  that  the 
ratio  Xe/X^  at  which  the  array  reliability  can  be  ignored  in  computing  MTTF  Is  lower  for  the 
SBFM  model  than  for  the  WCFM  model.  This  difference  becomes  greater  for  larger  chip  size. 
For  Xe  in  the  range  from  1 to  100  this  corresponds  to  a Xe/Xb  of  10^  to  10®  for  the 
Xb  values  In  Table  4-1.  This  is  well  into  the  range  where  the  SBFM  assumption  shows  that 
the  memory  reliability  can  be  modeled  as  simply  that  of  the  support  circuitry,  and  just  at  or 
2 • ■ ■ 
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COMPARISON  OF  SBFM,  WCFM,  AND  NR  MTTF  MODELS 

Figure  2 


below  that  range  for  the  WCFM  assumption. 


. The  normalized  MTTF  for  the  nonrcdundant  memory  (assuming  r - 0.1)  Is  also  plotted  In 
Figure  2.  It  shows  the  same  behavior  as  the  ECC  memories,  l.e.  the  MTTF  Is  limited  by  the 
control  circuitry  MTTF,  although  at  a higher  value  of  K also  points  up  the  fact  that  by 

the  time  that 


Vxb  * 


u b 
( 1 — r ) 


nonredundant  memory  becomes  more  reliable  than  ECC  memory,  and  that  for  large  Xe/X^, 
Its  MTTF  Is  greater  by  the  factor  1/r. 
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In  summary,  the  formulas  and  derived  curves,  such  as  Figure  2,  can  be  used  to  select  the 
appropriate  memory  organization  as  a function  of  *eA[j  and  failure  modes  assumptions. 

4.2.  The  Hazard  Function 

The  hazard  function  z(t)  expresses  the  instantaneous  failure  rate  of  a population. 
Mathematically  it  is  related  to  the  reliability  function  by  equation  (3-3).  At  a given  time  it 
measures  the  ratio  of  the  instantaneous  rate  of  change  in  reliability  to  the  current  reliability. 
A constant  hazard  function  implies  that  the  percentage  change  In  reliability  is  constant 
through  time,  thus  the  corresponding  reliability  function  is  exponential.  An  increasing  hazard 
function  implies  that  tt  ->  percentage  change  in  reliability  grows  larger  with  time,  and  can  be 
thought  of  as  accelerating  (rather  than  just  increasing)  unreliability.  An  increasing  hazard 
function  is  inherent  for  redundant  systems.  Intuitively,  as  a redundant  system  approaches 
the  limit  of  its  tolerance  to  failures  it  becomes  more  unreliable  than  it  was  when  new. 

Based  on  the  specific  failure  rates  in  Table  2,  the  hazard  functions  for  32  bit  logical  word 
memories  of  16K  and  64K  words  were  calculated  for  both  ECC  SBFM  and  WCFM  models,  as 
well  as  for  the  nonredundant  memory  model.  The  results  are  plotted  in  Figure  3. 

For  the  SBFM  model  the  hazard  is  nearly  constant  for  the  eighty  years  shown,  and  the 
two  differently  sized  memories  exhibit  an  almost  total  hazard  function  dominance  by  the 
control  circuitry’s  constant  hazard  function  z(t)  - \e.  The  WCFM  model  exhibits  much 
different  behavior  for  this  ratio  of  \ /X^.  For  both  sizes  of  memory  the  hazard  functions 
increase  throughout  the  eighty  years,  with  a rapid  rise  In  the  first  10  to  20  years  as  the 
memory  array  hazard  function  grows  and  eventually  dwarfs  the  contribution  of  the  control 
circuitry’s  constant  hazard  function.  At  the  end  of  15  to  25  years  the  WCFM  models  have 
larger  hazards  than  the  models  for  the  nonredundant  memories  of  the  same  (logical)  size. 
These  latter  also  exhibit  constant  hazard  functions,  which  for  larger  size  memories  are 
dominated  by  the  greater  constant  hazard  of  the  memory  array  alone  V>Xenr>- 

The  SBFM  model  hazard  function  Is  the  same  In  form  as  the  WCFM  model  hazard  function, 
but  exhibits  different  behavior  for  the  same  values  of  Xe/X^,  as  seen  In  Figure  3.  Figure  A 
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SBFM,  WCFM,  AND  NR  MODEL  HAZARD  FUNCTION  COMPARISON 

Figure  3 

demonstrates  the  effect  of  varying  X^  while  holding  Xe  constant  (e.g.  more  reliable  memory 
for  the  same  control  technology,  thus  Increasing  Xg/X^).  For  larger  X^  the  memory  array 
hazard  function  becomes  more  important  and  the  SBFM  model  begins  to  exhibit  the  same 
qualities  seen  In  Figure  3 for  the  WCFM  model.  Below  some  X^  the  nonredundant  memory 
model  has  a consistently  lower  hazard  function  than  the  SBFM  model,  as  shown  by  the  lowest 
curve  in  Figure  4 (the  nonredundant  memory  models  for  X^  2 0.000732  are  well  above  the 
range  of  the  Figure  4 plots). 

The  effect  of  logical  word  size  on  memories  of  the  same  size  (in  words)  and  the  effect  of 
logical  memory  size  (In  terms  of  the  total  number  of  bits)  are  shown  In  Figure  5.  An  Increase 
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Figure  4 

in  word  size  and/or  memory  size  causes  a corresponding  increase  in  the  Initial  hazard  (i.e. 
that  of  the  control  alone),  while  the  Increase  in  the  number  of  memory  cells  causes  a larger 
dependence  on  the  memory  array  hazard  function  In  spite  of  the  increase  In 

Block  and  Hamming  coded  memories  are  compared  in  Figure  6.  The  three  upper  curves 
are  for  Hamming  coded  memories  with  1,  2 and  4 logical  words  per  physical  word.  The  major 
effect  of  saving  on  memory  chips  by  combining  logical  words  Is  to  decrease  memory 
reliability.  Since  the  control  logic  failure  process  is  dominant,  adding  control  logic  simply 
increases  the  hazard  function  by  a constant  (corresponding  to  a decrease  in  reliability  of  the 
Hamming  code  memory). 


z (t),  10  T 


3 


a 64  bit  64K  words  SBFM  model,  Xb=0.000122 
+■  32  bit  64K  word  SBFM  model,  Xb-0.000122 
x 16  bit  64K  word  SBFM  model,  Xb=0.000122 
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EFFECT  OF  LOGICAL  WORD  SIZE  ON  HAZARD  FUNCTION 


Figure  5 
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z(t)  FOR  LOGICAL  WORD  SIZE  < PHYSICAL  WORD  SIZE 

Figure  6 

The  block  coded  memory  behaves  differently.  Because  its  control  circuitry  Is  less 
complex  than  that  of  any  of  the  Hamming  memories,  It  has  a lower  initial  hazard  function.  The 
slope  of  the  hazard  function  shows  the  effect  of  the  greater  Inherent  unreliability  of  the 
larger  word  size.  Even  so,  the  block  code  memory  does  not  become  more  unreliable  over  the 
eighty  years  because  its  hazard  function  never  gets  as  large  as  the  Hamming  code  hazard 
functions. 

4.3.  MTTF,  the  Hazard  Function,  and  Reliability:  an  Example 

This  subsection  brings  together  all  of  the  tools  developed  so  far  to  help  In  a decision 
between  a nonredundant  and  a Hamming  coded  memory.  Specifications  for  the  two  alternate 
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architectures  are  aiven  In  Table  4-3. 

TABLE  4-3 

Alternate  Memory  Architecture  Specifications 

NR  ECC 

word  size  S4  bits  72  bits 

memory  size  8192  words  8192  uords 

control  2.0S  20.39 

memory  chips  I 

0.2 

a i ze  409G  bits 

dominant  fai 1 mode  whole  chip 

Equations  (3-7)  and  (3-4)  are  used  to  calculate  MTTFs  for  the  NR  and  ECC  memories.  The 
NR  memory  has  a slightly  higher  MTTF  (36,153  hrs  vs  35,800  hrs).  However,  the  hazard 
function  for  the  ECC  memory  (equation  (3-5))  is  less  than  that  for  the  NR  memory  (equation 
(3-8))  for  the  first  2 3/4  years,  as  illustrated  at  the  top  of  Figure  7.  When  the  reliability 
functions  for  the  memories  are  computed  using  equation  (3-6)  for  the  NR  architecture  and  the 
WCFM  equivalent  of  equation  (3-1)  for  the  ECC  architecture,  it  Is  seen  that  the  ECC  memory 
is  more  reliable  by  several  percent  over  the  first  few  years  of  operation. 

4.4.  Summary 

The  models  developed  in  Chapter  3 and  analyzed  In  this  section  form  a set  of  easily 
applied  tools  which  can  help  In  evaluating  memory  system  design  spaces.  One  Indicator  alone 
is  often  not  enough,  as  demonstrated  In  the  previous  subsection. 

ECC  memories  are  not  inherently  more  reliable  than  nonredundant  ones.  With  very 
reliable  memory  chips  the  limiting  factor  on  reliability  is  the  control  circuitry.  When  using 
standard  SSI/MSI  logic  Hamming  code  control  circuitry  has  a failure  rate  several  times  that  of 
the  control  circuitry  for  an  equivalent  NR  memory.  Block  coded  memory,  which  needs  less 
complex  control  circuitry,  is  more  reliable  than  Hamming  code  memory.  Using  more  reliable 
LSI  logic  for  ECC  control  would  greatly  Improve  the  total  ECC  memory  reliability. 
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Figure  7 

5.  Performance  Effects  of  ECC 


The  use  of  ECC  memory  for  main  memory  or  microstore  affects  system  performance. 
Since  in  most  cases  error  checking  can  be  carried  out  In  parallel  with  the  use  of  the  data 
there  will  usually  be  no  performance  change  in  an  error-free  state.  This  Is  possible  If  no 
Irreversible  actions  (e.g.  overwriting  information  needed  to  restart  the  current  operation) 
are  taken  before  the  error  checking  has  been  completed,  and  if  the  hardware  has 
stall/restart  capabilities.  Most  processor/main  memory  systems  and  vertically  coded 
microemulators  belong  in  this  class.  As  a counterexample  a horizontally  microcoded  machine 
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with  a short  microcvcle  and  a verv  lone  word  width  would  not  allow  this,  as  the  DroDaeation 
time  through  the  several  decoding  tree  levels  required  would  be  greater  than  the  microcycle 
time.  This  case  shouldn’t  occur  very  frequently,  however.  Thus  this  section  focuses  on  the 
effect  which  recoverable  errors  in  memory  have  on  system  performance. 

5.1.  Main  Memory  Performance  in  the  Presence  of  Error# 

First,  assume  that  every  word  in  a w word  memory  is  equally  likely  to  be  accessed.  If  the 
access  time  of  the  memory  is  c,  the  amount  of  additional  time  required  to  correct  an  error  is 
<c.  When  there  are  errors  in  n different  words  in  a Hamming  coded  memory,  the  expected 
memory  access  time  is 

(1  - -)  c + (-)  (c  + cc) 
u u 


- c 


(1  + — ) 
u 


(5-1) 


since  the  probability  of  an  error  in  a given  word  is  n/w.  In  the  case  of  block  codes,  errors  in 
a still-functioning  memory  are  distributed  in  such  a way  that  there  is  no  more  than  one  error 
per  block.  Thus  the  probability  of  an  error  occurring  In  any  given  logical  word  (j  words  per 
block)  is 

P - Pr  terror  in  block]  »v  Pr  terror  in  uord  I error  In  block! 


(ui/j)  j u 


(5-2) 


so  that  equation  (5-1)  still  holds. 


Next  assume  that  the  access  frequency  is  not  uniform  throughout  the  memory,  so  that 
some  memory  segments,  such  as  those  containing  parts  of  the  operating  system  kernel,  are 
more  likely  to  be  accessed  than  others.  Suppose  that  each  location  I has  access  probability 
P|,  and  that  there  are  n errors  in  memory.  The  expected  memory  access  time  can  be 
expressed  as  the  weighted  sum 
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Epi(i  - j c +Hpi(^ ,c  + (c) 


+ z tc)  E pi 


which  reduces  to  equation  (5-1)  as  well  since 


Epi  - »• 


Thus  in  both  cases  the  exoected  degradation  of  the  memory  access  time  is  n(/w. 

The  effects  of  errors  on  memory  access  time  are  Illustrated  in  Table  5-1  for  several 
values  of  n and  w.  Two  types  of  ECC  memory  are  represented:  a Hamming  code  memory 
with  an  < of  1,  and  a block  coded  memory  with  an  < of  128  due  to  the  necessity  of  reading  all 
of  the  words  in  the  block  to  determine  the  vertical  parity.  The  performance  degradation  is 
negligible  (<  17.)  for  the  Hamming  code,  while  the  degradation  becomes  significant  for  the 
block  code  only  when  n becomes  large. 


The  degradation  of  system  performance  depends  on  how  often  the  memory  is  accessed. 
A system  with  a low  memory  bandwidth  utilization  will  exhibit  less  degradation  than  one 
where  the  bandwidth  is  almost  saturated.  A comparison  of  three  different  PDP-11  systems 
serves  as  a good  example.  The  data  in  Table  5-2  are  drawn  from  [Snow77]  and  are  the 
result  of  dynamic  measurements  of  PDP-11  programs.  Another  result  from  the  same  source 
is  that  an  average  of  2.3166  memory  references  are  caused  for  each  instruction.  If  Tm  Is  the 
memory  access  time,  Tj  the  average  Instruction  execution  time,  and  D the  expected  memory 
access  time  degradation,  the  expected  system  degradation  D$  is 

□ T_  (2.31GG) 
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TABLE  5-1 

Normalized  Expected  Memory  Access  Degradation 
As  a Function  of  Memory  Size  (w  words), 
Number  of  Failures  (n), 


and  ECC  ( ( 

“1  for  Hamming,  ( 

=128  for  Block  Code) 

n 

( 

1 6K  words 

128K  words 

1 

1 

0.00006 

0.0000076 

128 

0.0078 

0.00098 

4 

1 

0.0002 

0.000031 

128 

0.031 

0.004 

10 

1 

0.0006 

0.000076 

128 

0.078 

0.0098 

100 

1 

0.006 

0.00076 

128 

0.78 

0.098 

TABLE  5-2 

Timing  Data  for  PDP-11  Computer  Systems 

system 

time  in  microseconds  for  : 
memory  access  avg.  instruction 

LSI-1 1 

.400  5.883 

PDP-11/10 

.600  4.096 

PDP- 11/34 

.940  3.129 

The  data  In  Table  5-3  results  from  this  expression.  Even  when  there  is  severe  (102) 
memory  degradation,  the  system  degradation  Is  negligible  except  for  the  POP- 11/34  system, 
whose  processor  comes  close  to  saturating  Its  processor-memory  bandwidth.  Therefore, 
« even  though  the  memory  performance  degradation  is  more  serious  for  block  codes  than 

Hamming  codes  as  shown  In  Table  5-1,  the  overall  system  performance  would  be  comparable 
over  wide  ranges  of  failure  situations. 

4 

5.2.  Microstore  Performance  in  the  Presence  of  Errors 

J 

Microstore  reliability  Is  becoming  more  important  as  the  use  of  microcoded  system  design 
Is  Increasing.  The  growing  size  of  microstores  being  used  and  the  subsequent  effect  on 
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TABLE  5-3 

Instruction  Degradation  for  Various  Models  of 
The  PDP-11  Assuming  Various  Amounts  of 
Memory  Access  Degradation 


Memory  Cycle 

Instruction  degradation 

Time 

Degradation  LSI- 1 1 

PDP-11/10 

PDP-11/34 

0.0001 

0.0000158 

0.000034 

0.00007 

0.001 

0.000158 

0.00034 

0.0007 

0.01 

0.00158 

0.0034 

0.007 

0.1 

0.0158 

0.034 

0.07 

1.0 

0.158 

0.34 

0.7  . 

system  reliability  makes  error  coding  techniques  more  attractive.  Unlike  main  memory  where 
very  degraded  segments  of  main  memory  can  be  left  unallocated,  degraded  sections  of 
microcode  are  permanently  allocated  and  will  continue  to  affect  system  performance  until 
they  can  be  repaired.  For  these  reasons  system  performance  degradation  in  the  presence  of 
microstore  errors  is  an  important  issue. 

5.2.1.  A Model  for.  Performance  Degradation 

I 

A simplified  view  of  a microcoded  machine  is  outlined  in  Table  5-4.  It  is  assumed  that  all  F 

fetch  and  S service  microwords  are  executed  during  each  macrocycle.  The  expected 
' ! 1 ' 
macrocycle  time  Mq  with  no  errors  present  is 

«•  » . r 

a i 

EiM0]  . IF+S+^AjP,  +£lkPk>  . 

j-1  k-1 

« (F  + S + A + n m 

where  m is  the  microcycle  time. 

In  formulating  the  performance  degradation  model  two  further  assumptions  are  made. 

The  first  is  that  the  probability  distribution  of  errors  is  uniform  over  all  memory  words.  The 
second  is  that  an  error  code  with  one  logical  word  per  physical  word  is  being  used.  If  the 

i 

excess  time  needed  to  correct  a word  with  an  error  is  Cm  and  there  are  n errors  in  the 
memory,  the  expected  macrocycle  time  is 

1 
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E CM  3 - F + S + A + I + (F-  + S-  +Y'' P A - +Y''p 

u u £—d  J J u / j 

j-1  k-1 

since  the  expected  number  of  errors  in  I words  is  In/w.  This  reduces  to 
E [tin]  - E m0]  + — (F  + S + A + T) 


k ‘k 


1 1.  - ) cm 


nc 


- E [flo]  ( 1 + — ) 

° u 

Thus  the  expected  performance  degradation  is  nf/w,  as  with  main  memory. 


(5-3) 


TABLE  5-4 

Microstore  Model  - Allocation  and  Access  Frequency 


Purpose  Size 

fetch  F 

interrupt  S 

service 

addressing  Aj 
mode 

instruction  JL 


P[access]  • in  Microstore 


total 


memory  w - F ♦ S +£Aj 


In  the  case  of  a block  code  of  j words  per  block,  the  expected  number  of  errors  in  I 
words  is 

1 • ' , ./ 

Vpu,  . y; 

»/j  j u 

i-1  i-1 

(see  equation  (5-2)),  assuming  that  there  are  n errors  in  a functioning  microstore.  Thus 
equation  (5-3)  still  holds.  ■ 

Computers  with  microstores  of  256,  1024,  and  4096  words  are  considered  as  examples. 
The  performance  degradation  expected  with  n-I,2,  and  3 errors  present  is  presented  in 
Table  5-5  for  the  cases  where  (-1  (Hamming  code)  and  (-16  (block  code,  16  words  per 
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I 

] 

M 

1 

t 

i 


block).  The  exoecled  decradation  is  neelieible  for  (he  Hammine  code  case.  For  the  block 
coded  case  the  degradation  is  negligible  when  the  block  size  is  small  in  relation  to  the 
memory  size. 

TABLE  5-5 

Expected  Macrocycle  Degradation  in  the  Presence  of  Errors 


number  of  errors 


w 

< 

1 

2 

3 

256 

1 

0.0039 

0.0078 

0.017 

16 

0.0625 

0.125 

0.1875 

1024 

1 

0.0010 

0.0020 

0.0029 

16 

0.0156 

0.0313 

0.0469 

4096 

1 

0.0002 

0.0005 

0.0007 

16 

0.0039 

0.0078 

0.0117 

‘5.2.2.  Distribution  of  Performance  Degradation 


Given  a machine  such  as  the  one  outlined  in  Table  5-4,  the  probability  distribution  of  the 
performance  degradation  with  n errors  present  can  be  computed.  The  locations  of  the  n 
errors  in  microstore  can  be  represented  by  the  vector  f,  which  has  one  element  for  each  of 
the  fetch  and  service  areas  as  well  as  for  each  of  the  instructions  and  addressing  modes. 
Thus  f has  (2+a+i)  elements  which  sum  to  n.  The  degradation  probability  distribution  can  then 
be  computed  using  the  formula 


(3 


£5-4) 


for  the  probability  of  a given  error  vector  . The  expected  performance  degradation 

• 1 

associated  with  the  combination  of  this  vector  over  all  addressing  modes  and  instructions  is 


ou,  £ 

J-l  k-1 


,(fF  + fS  + fA.  + Pj  Pk  £ 


(5-5)  J 


'i  In  1 


22  May  1978 


P*t«  31 


which  gives  the  expected  degradation  in  terms  of  microstore  cycle  time.  This  quantity  is 
evaluated  for  each  valid  f and  the  results  compiled  to  give  the  probability  distribution  for 
performance  degradation  with  n errors  present. 

As  an  example,  consider  the  case  of  one  microstore  error  in  the  machine  described  in 
Tables  5-6  and  5-7,  which  detail  a simple  model  of  the  PDP-11  based  on  [Snow77].  In  this 
example,  addressing  modes  with  approximately  equal  access  probabilities  are  grouped  into 
classes.  Instructions  are  treated  in  the  same  way.  With  only  one  error  present, equation 
(5-4)  for  the  probability  of  the  the  vector  f occuring  reduces  to 

“ U 

where  the  nonzero  element  of  f Is  In  the  xth  element,  and  the  section  of  code  containing  the 
error  is  represented  by  Rx  (i.e.  Rx  corresponds  to  one  of  F,  S,  A^,  or  Ic).  The  application 
of  formula  (5-5)  also  simplifies  considerably  since  only  one  functional  area  of  the  microcode 
can  have  an  error  in  It.  The  resulting  probability  distribution  of  the  performance  degradation 
Is  listed  in  Table  5-8. 

TABLE  5-6 

Microstore  Specifications 

F -3 
S - 10 

Aj  - 3 for  all  j 
v - 3 for  all  k 
a - 16 
i - 65 
w - 256 


The  probability  of  negligible  (<17.)  degradation  is  937-  The  probability  that  the 
degradation  is  less  than  the  expected  degradation  (.0039  from  Table  5-5)  is  867-  The 
probability  of  noticeable  degradation  (>5 7.)  is  only  52,  while  severe  degradation  does  not 
occur. 

Table  5-9  contains  the  probability  distribution  for  the  above  machine  when  two  errors 
are  present.  Although  there  Is  a possibility  of  severe  degradation,  the  probability  is  small 
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TABLE  5-7 

Assumed  Dynamic  Distributions  of 
Addressing  Modes  and  Instructions 


Addressing 

Number 

Class 

Pi 

In  Class 

1 

0.3 

1 

2 

0.15 

2 

3 

0.08 

2 

4 

0.05 

3 

5 

0.03 

4 

6 

0.0 

4 

Instruction 

Number 

Class 

Pk 

in  Class 

1 

0.2 

1 

2 

0.08 

2 

3 

0.03 

10 

4 

0.015 

16 

5 

0.003 

36 

TABLE  5-8 

Probability  Distribution  of  Performance  Degradation 
(one  error  present) 


Degradation 

Probability 

0.0526 

0.0508 

0.0159 

0.0117 

0.0108 

0.0117 

0.00796 

0.0234 

0.00434 

0.0234  • - 

0.00424 

0.0234 

0.00265 

0.0352 

0.00163 

0.117 

0.00159 

0.0469 

0.000813 

0.188 

0.000163 

0.422 

0 

0.0469 

bability  that 

the  degradation  will  be  less  than  1 7.. 

This  same  method  was  used  to  derive  the  probability  distributions  of  degradation  for  a 
block  coded  microstore  with  the  characteristics  given  in  Table  5-10.  The  resulting 
distributions  are  presented  In  Table  5-11  and  5-12.  The  dynamic  distributions  of  Table  5-7 
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TABLE  5-9 

Probability  Distribution  of  Performance  Degradation 
(two  errors  present) 


Degradation 

Probability 

.105 

.0024 

.060-. 068 

.0048 

.052-.057 

.0917 

.032 

.0001 

.020-.024 

.0013 

.016-.019 

.0214 

.011 -.015 

.0204 

.005-.010 

.0630 

.002-004 

.1634 

.001 -.002 

.3741 

.0003 

.1770 

.0 

.0780 

were  used  for  this  example  as  well. 


TABLE  5-10 

Microstore  Specifications 

F - 4 
S - 12 

A-  ■ 4 for  all  j 
L - 4 for  all  K 
a - 16 
I - 64 
w - 336 

16  words  per  blocK 


The  performance  degradation  for  the  block  code  microstore  is  more  severe  than  for  the 
Hamming  coded  microstore.  With  one  error  present,  the  probability  of  severe  degradation 
(greater  than  10 7)  Is  about  87,  while  the  probability  of  negligible  degradation  (17.  or  less)  Is 
only  657.  When  there  are  two  errors  present,  the  chance  of  a severe  performance  loss  Is 
177  and  that  of  a benign  failure  drops  to  407. 
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TABLE  5-11 

Probability  Distribution  for  Degradation 
With  a Block  Coded  Microstore 
(one  error  present) 


Degradation 

Probability 

0.667 

0.0476 

0.2 

0.0119 

0.133 

0.0119 

0.1 

0.0238 

0.053 

0.0595 

0.033 

0.0357 

0.02 

0.1666 

0.01 

0.1905 

0.002 

.4167 

0.0 

0.0476 

TABLE  5-12 

Probability  Distribution  of  Performance  Degradation 
(two  errors  present) 


.867 

.0011 

.800 

.0011 

.767 

.0023 

.667-.720 

.0864 

.300 

.0006 

.153-.253 

.0245 

.100-.  143 

.0563 

.073-.087 

.0090 

.053-.064 

.0710 

.030-.040 

.1080 

.020-.025 

.2314 

.012 

.1592 

.0004 

.1547 

0.0 

.0773 

6.  Conclusions 


The  way  in  which  memory  chips  fail  affects  the  reliability  of  single  error  correcting 
memories.  It  also  dictates  the  choice  of  models  for  memory  system  reliabilities.  When  the 
dominant  failure  mode,  chip  failure  rate,  and  control  failure  rate  are  known,  the  models 
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Dresented  can  be  used  in  makine  tradeoff  analyses  in  memory  svstem  desien. 

Error  correcting  code  memories  are  not  automatically  more  reliable  than  equivalent 
nonredundant  memories,  as  fhe  limiting  factor  is  the  reliability  of  the  control  circuitry.  For 
the  same  reason  block  coded  memories  tend  to  be  more  reliable  than  Hamming  coded 
memories.  An  Increase  in  the  reliability  of  the  control  circuitry  will  bring  about  a 
corresponding  increase  in  ECC  memory  system  reliability,  which  is  a strong  argument  for  the 
use  of  LSI  control  circuitry. 

When  error  checking  and  use  of  data  is  paraded,  error  correcting  memories  can  have 
performance  similar  to  nonredundant  memories  when  no  failures  are  present.  In  the  majority 
of  cases  the  performance  of  systems  containing  error  correcting  code  memories  experiences 
negligible  degradation  in  the  presence  of  failures.  Block  coded  memories,  which  are  more 
reliable  than  Hamming  coded  memories,  experience  more  performance  degradation  when 
errors  are  present  although  the  degradation  Is  still  negligible  in  most  cases. 


I.  Proof  of  the  MTTF  Formulae 


I 


To  derive  the  iterative  formula  for  MTTF^  presented  in  this  paper,  the  integral 
expression  for  MTTF^  is  evaluated  in  the  following  manner'.  ” 


MTTF 


sb 


•/; 


-X  t -(k-l)X.t 


(k  e 


h 1 . . . -kX.  t u 

15  - (k-1)  e b ) dt 


■/ 


" -X_t  -XK(k-l)wt  , • . -X.  t u 

e 6 e b (k  - v k— 1 ) e b ) dt 


The  next  step  is  to  make  the  substitution 


x - e 


-v 


dx  • -X.  e *b*  dt  , 


x,t^'-  0 • 
and  xl - 1. 

To  further  simplify  the  integral,  let 

m - (k-1)  u + X /X  - 1 , 

n - u , 

a - k , 

and  b - - (k-1) . 

The  integral  becomes 


MTTF 


sb  “ 


1 r0  » 
--  / x" 

Xu  J 1 


(a  + bx)  dx, 


which  has  the  recursive  solution 


. (m+1)  , n 

1 / x (a  + bx)  an 


MTTF 


sb  " 


r(: 


m+n+l 


m+n+1 


7 /■' 


(a  + bx)  dx 


•) 


After  doing  one  more  rec,.  on,  the  equation  becomes 


, (m+1)  , . , n 

lx  (a  + bx) 


MTTFgb 


Xb  m+n+1 


(m+1)  (n-1 ) 

an  x (a  + b>) 

+ ( 

m+n+1  m+n 


a (n-1) 


I’ la 


+ bx)  ^ dx  )) 


More  simplifications  are  now  introduced.  Let 
fj  - (m+n+1)  - i 

9|  - n - i 
and  y - a + bx. 


(-  wk  + *e/*c  - i ) * 
(-  w - i ) , 


With  some  rearranging  the  MTTFsb  equation  reduces  to 

1 X1"*11  g00 

HTTFsb < 


(m+1)  g. 

^ 90  x 9 1 a gj 

( 4-  / x U 


_JL  + 1 lif 
<i  <i  J 


x y ^ dx  ) ) 


The  final  term  in  the  recursion  is 


(n-U  / 


m gn  , a 3 (n-1)  * 
x y n dx  - 


(m+1) 


(m+1) 

Thus,  x can  be  factored  out,  giving 


(ffl+1) 

x 90  a 90  9l  a 9l 

(y  + (y  + (... 

V0  fl  f2 


a 9 (n-1) 


)...) 


....  _ (m+1)  (m+1) 

When  x - 0,  x - 0,  while  at  x - 1,  x » 1 and 

9j  9 j 

y - (k-(k-l) ) - 1, 


giving 


9 9g  a 9} 

(l  + (l  + — i (. 


a S(n-l) 


)) ). 


A final  reorganization  yields  the  formula  presented  in  equation  (3-2),  namely 


1 1 a On 

nTTFsb  = -<-  + — 
\ f0  f0fl 


a Q0  9(n-l)  , 

+ ) 

f 0 fn 


An  important  point  to  note  is  that  in  solving  the  integral  m is  assumed  to  be  an  integer, 
which  in  turn  constrains  XeA^  to  also  be  an  integer.  In  almost  all  cases  this  constraint  is  not 
a problem,  because  normally  \g  » X^. 

The  derivation  of  the  iterative  formulae  for  MTTFWC  and  MTTFr  follows  the  same  route  as 
that  for  tvtTTFg^,  with  only  the  few  parameter  changes  noted  in  sections  3.2.2  and  3.2.3. 
Depending  on  the  implementation,  the  integer  constraint  for  Xe/Xc  may  be  a problem,  but 
again,  for  most  cases  it  should  not  be. 


II.  Derivation  of  the  Support  Circuitry  Failure  Rates 


The  support  circuit  failure  rates  in  Table  4-2  for  nonredundant,  Hamming  code,  and  block 
code  memories  are  derived  from  models  of  the  support  circuitry  required  for  the  "no-frills" 
memory  systems  shown  in  Figure  1.  These  models  provide  rough  estimates  of  the  number 
and  type  of  standard  SSI/MSI  TTL  packages  needed  without  necessitating  actual  circuit 
design. 

TABLE  B-l 

Number  of  IC’s  in  Support  Circuitry 

' B bits/word,  W words  in  Memory  * 

K bits/physical  word,  J words/physical  word 

chip  type  number  of  chips 

(i)  nonredundant  memory 

random  logic  10 

bus  xcvrs  fb/4l 

latches  f(  log2  w )/4] 

(ii)  Hamming  code  memory 


random  logic 
bus  xcvrs 
latches 
parity  trees 
comparators 
XOR 
inverter 
4-+16  DEMUX 
2-*l  MUX 
8-*l  tristate  MUX 
16-*1  MUX 
8->l  MUX 
4-»l  MUX 


30 

fb/4l 

R log2  w )/4]  + rk/41 

(flog2  kl  * r<|k/2l  * 10)/8il)  ♦ Rk-l)  * 10/811 
f(2  + riog2(j*b)1)  * 5/4| 
lk/4) 
fk/6l 
fk/16l 

|k/41  (*2  iff  j-2) 
k*fj/8l  iff  j>  1 6 
k iff  16>j>8 
k iff  8>j>4 
fk/2l  iff  4>j>2 


(iii)  block  code  memory 


random  logic 
bus  xcvrs 
latches 
parity  trees 
XOR 


30 

rk/4i 

r<fiog2  w1)/41  + r<b*l)/4l 

[b  * 10/811 

r<b+i)/4i 


tt 


1 


I 

1 

i 

I 

i 


I 

. 


The  formulae  used  in  computing  numbers  of  packages  are  shown  in  Table  B-l.  For  the 
overhead  associated  with  control  of  the  entire  memory,  an  arbitrary  number  of  "average”  size 
chips  (15  gates)  was  chosen.  Ten  such  chips  are  used  in  the  nonredundant  memory  model 
and  thirty  are  used  in  each  ECC  memory  model.  These  numbers  are  only  “order  of 
magnitude"  guesses,  but  the  inaccuracy  involved  is  not  enough  to  affect  the  conclusions 
drawn  from  these  models  since  the  major  proportion  of  the  support  circuitry  is  in  the  data 
paths  and  data  operators. 

i 

I 

Failure  rates  for  the  integrated  circuits  are  calculated  using  [DOD76]  and  the  following 

I 

assumptions: 

n0  » 16.  (class  C)  , 

T^  = 40  C 

- 0.2  (ground  benign) 

» 1.0  (mature  technology) 

The  resulting  failure  rates  are  listed  in  Table  B-2. 


TABLE  B-2 
Chip  Failure  Rates 


chip  type 

model 

gates* 

i 

r 

X 

random  logic 

15 

.077 

latch 

74175 

24 

.099 

9 bit  parity 
decode 

74280 

46 

.230 

comparator 

7485 

31 

.181 

4->16  DEMUX 

74154 

25 

.103 

inverter 

7464 

4 

.048 

XOR 

7486 

4 

.039 

2-»l  MUX 

74157 

15 

.077 

4-+1  MUX 

74153 

16 

.082 

8-»l  MUX 

74151 

14 

.074 

" (tristate) 

74251 

17 

.084 

16-*1  MUX 

74150 

25 

.103 

bus  xcvr 

- 

8 

.056 

tobtained  from  [DOD76]  when  possible 


' . I 

These  models  do  not  necessarily  provide  accurate  support  circuitry  failure  rates  of  actual 
memories.  However,  they  do  show  well  the  relative  effects  of  the  different  coding  schemes 
as  well  as  of  different  memory  parameters  (e.g.  word  size,  block  size,  number  of  words  in 

I 4 » • I,  >• 

. f * » 


.1 


memory)  on  the  support  circuitry  reliability.  Thus  they  can  be  used  as  in  Chapter  4 to 
demonstrate  the  relative  effects  of  the  support  and  memory  array  circuitry  on  memory 
system  reliability.  .... 
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A brief  survey  of  available  memory  chip  failure  mode  data  is  made  and  shows 
that  parlial  chip  failures  are  more  prevalent  than  whole  chip  failures.  Based  on  the 
findings  of  this  survey,  reliability  models  for  memory  systems  with  error  coding 
techniques  are  developed.  The  effect  of  memory  support  circuitry  on  memory 
reliability,  usually  ignored  in  the  development  of  analytical  models,  is  included.  It  is 
shown  that  for  wide  ranges  of  memory  system  parameters  and  memory  element 
failure  rates  the  memory  system  reliability  is  dominated  by  the  effect  of  the 
support  electronics.  The  use  of  these  models  in  design  tradeoff  decisions  is 
explored. 

The  performance  of  systems  with  fault  tolerant  memory  when  there  are 
correctable  failures  present,  an  area  which  has  seen  little  work,  is  analyzed. 
Performance  models  for  systems  with  fault  tolerant  main  memory,  as  well  as  those 
with  fault  tolerant  microstore,  are  developed  and  their  properties  explored.  

Hamming  code  is  one  of  the  error  corecling  techniques  considered.  Block  codes, 
commonly  used  for  tape  media  but  rarely  if  ever  for  RAM  or  ROM,  are  also 
considered  and  found  competitive  with  Hamming  codes  in  many  cases. 
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